Distance based algorithms for small biomolecule classification and structural similarity search

نویسندگان

  • Emre Karakoç
  • Artem Cherkasov
  • Süleyman Cenk Sahinalp
چکیده

MOTIVATION Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed. The notion of similarity should be chosen for providing the highest level of discrimination of compounds wrt the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds. RESULTS In this paper we focus on the k-nearest-neighbor search method, which, until recently was not considered for small molecule classification. The few recent applications of k-nn to compound classification focus on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance L(p). Here we show how to computationally design the optimal weighted Minkowski distance wL(p) for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. We then show how to construct pruning based k-nn search data structures for any wL(p) distance that minimizes similarity search time. The accuracy achieved by our classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, our classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, our classifier quantifies the level of bioactivity rather than returning a binary decision and thus is more informative than the ANN approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

A partition-based algorithm for clustering large-scale software systems

Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...

متن کامل

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

IFSB-ReliefF: A New Instance and Feature Selection Algorithm Based on ReliefF

Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomp...

متن کامل

A heuristic approach for multi-stage sequence-dependent group scheduling problems

We present several heuristic algorithms based on tabu search for solving the multi-stage sequence-dependent group scheduling (SDGS) problem by considering minimization of makespan as the criterion. As the problem is recognized to be strongly NP-hard, several meta (tabu) search-based solution algorithms are developed to efficiently solve industry-size problem instances. Also, two different initi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 22 14  شماره 

صفحات  -

تاریخ انتشار 2006